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Abstract 

We consider a situation in which we see samples Xn G drawn i.i.d. from some 
distribution with mean zero and unknown covariance A. We wish to compute the 
top eigenvector of A in an incremental fashion - with an algorithm that maintains 
an estimate of the top eigenvector in 0{d) space, and incrementally adjusts the 
estimate with each new data point that arrives. Two classical such schemes are 
due to Krasulina (1969) and Oja (1983). We give finite-sample convergence rates 
for both. 


1 Introduction 


Principal component analysis (PCA) is a popular form of dimensionality reduction that projects a 
data set on the top eigenvector(s) of its covariance matrix. The default method for computing these 
eigenvectors uses 0{(P) space for data in which can be prohibitive in practice. It is therefore 
of interest to study incremental schemes that take one data point at a time, updating their estimates 
of the desired eigenvectors with each new point. For computing one eigenvector, such methods use 
0{d) space. 

For the case of the top eigenvector, this problem has long been studied, and two elegant solutions 
were obtained by Krasulina Q and Oja i). Their methods are closely related. At time n — 1, they 
have some estimate Vn-i G of the top eigenvector. Upon seeing the next data point, X„, they 
update this estimate as follows: 


Fn — Fn—1 “f Tri 


- 




llK-1 


Id V„_1 


K = 


K_i + 7 „A„AJU„_i 
||K_i+7„A„AJV;_i| 


(Krasulina) 


(Oja) 


Here 7 „ is a “learning rate” that is typically proportional to 1 /n. 

Suppose the points Xi,X 2 , ■ ■ ■ are drawn i.i.d. from a distribution on with mean zero and co- 
variance matrix A. The original papers proved that these estimators converge almost surely to the 
top eigenvector of A (call it v*) under mild conditions: 


• 7" = oo while J2u 

• If Ai, A 2 denote the top two eigenvalues of A, then Ai > A 2 . 

• E||A„||^ < 00 for some suitable k (for instance, k = 8 works). 


There are also other incremental estimators for which convergence has not been established; see, for 
instance, HU and di. 

In this paper, we analyze the rate of convergence of the Krasulina and Oja estimators. They can 
be treated in a common framework, as stochastic approximation algorithms for maximizing the 
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Rayleigh quotient 


G{v) = 




The maximum value of this function is Ai, and is achieved at v* (or any nonzero multiple thereof). 
The gradient is 


VG(t;) = 




Since = A, we see that Krasulina’s method is stochastic gradient descent. The Oja proce¬ 

dure is closely related: as pointed out in ifTOl . the two are identical to within second-order terms. 


Recently, there has been a lot of work on rates of convergence for stochastic gradient descent (for in¬ 
stance, ini), but this has typically been limited to convex cost functions. These results do not apply 
to the non-convex Rayleigh quotient, except at the very end, when the system is near convergence. 
Most of our analysis focuses on the buildup to this finale. 


We measure the quality of the solution 14, at time n using the potential function 




(K • v*f 


where v* is taken to have unit norm. This quantity lies in the range [0,1], and we are interested in 
the rate at which it approaches zero. The result, in brief, is that E[T'„] = 0(l/n), under conditions 
that are similar to those above, but stronger. In particular, we require that 7 „ be proportional to 1 /n 
and that ||X„|| be bounded. 


1.1 The algorithm 

We analyze the following procedure. 

1. Set starting time. Set the clock to time rio. 

2. Initialization. Initialize Vn„ uniformly at random from the unit sphere in 

3. For time n = rio -f 1, rio + 2,...: 

(a) Receive the next data point, X„. 

(b) Update step. Perform either the Krasulina or Oja update, with 7 „ = c/n. 

The first step is similar to using a learning rate of the form 7 „ = c/(n + Uo), as is often done in 
stochastic gradient descent implementations m. We have adopted it because the initial sequence of 
updates is highly noisy: during this phase 14 , moves around wildly, and cannot be shown to make 
progress. It becomes better behaved when the step size 7 „ becomes smaller, that is to say when n 
gets larger than some suitable Uo- By setting the start time to no, we can simply fast-forward the 
analysis to this moment. 


1.2 Initialization 

One possible initialization is to set V4„ to the first data point that arrives, or to the average of a few 
data points. This seems sensible enough, but can fail dramatically in some situations. 

Here is an example. Suppose X can take on just 2d possible values: ±ei, ±tTe 2 ,..., icrcd, where 
the Ci are coordinate directions and 0 < ct < 1 is a small constant. Suppose further that the 
distribution of X is specified by a single positive number p < 1: 

Pr(X = ei) = Pr(X = -ei) 

Pr(X = crci) = Pr(X = -aa) 

Then X has mean zero and covariance diag(p, a'^{1 — p)/{d — 1),..., a'^{1 — p)/{d — 1)). We will 
assume that p and tr are chosen so that p > a^{l — p)/{d— 1); in our notation, the top eigenvalues 
are then Ai = p and A 2 = — p)/{d — 1), and the target vector is v* = ei. 


P 

2 


1 -P 
2{d-l) 


for z > 1 
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If Vn is ever orthogonal to some Cj, it will remain so forever. This is because both the Krasulina and 
Oja updates have the following properties: 

Vn-l • = 0 ^ Vn= K-1 

Vri—1 ‘ ^ 0 —^ ^ span(Vj^_i, Xji). 

If 14,0 is initialized to a random data point, then with probability 1 — p, it will be assigned to some 
Ci with i > 1, and will converge to a multiple of that same e* rather than to Ci. Likewise, if it is 
initialized to the average of < 1/p data points, then with constant probability it will be orthogonal 
to ei and remain so always. 

Setting 14,0 to a random unit vector avoids this problem. However, there are doubtless cases, for 
instance when the data has intrinsic dimension ^ d, in which a better initializer is possible. 

1.3 The setting of the learning rate 

In order to get a sense of what rates of convergence we might expect, let’s return to the example of a 
random vector X with 2d possible values. In the Oja update 14 = 14-i + InXnX'^Vn-i, we can 
ignore normalization if we are merely interested in the progress of the potential function Since 
the Xn correspond to coordinate directions, each update changes just one coordinate of V: 

Xji = dici V 14,1 — 14— 1 , 1(1 + Tn) 

Xji — 'dlCTCi V Vn,i — l/i —l,i(l T ^ Tn) 

Recall that we initialize 14„ to a random vector from the unit sphere. For simplicity, let’s just 
suppose that rio = 0 and that this initial value is the all-ones vector (again, we don’t have to worry 
about normalization). On each iteration the first coordinate is updated with probability exactly 
p = Ai, and thus 

E[14.i] = (1 + Ai7i)(1 + A172) • • • (1 + Ai7„) ~ exp(Ai( 7 i H-h 7 „)) ~ 

since 7 „ = c/n. Likewise, for i > 1, 

E[14,i] = (1 + A27 i)(1 + A272) • • • (1 + A27„) ~ 

If all goes according to expectation, then at time n, 

„2cAi — 1 

_ ^ _ '1 _ _ _ 

||14P + (d - l)n2^^=^ „2 c(Ai-A2)- 

(This is all very rough, but can be made precise by obtaining concentration bounds for lnl4 i.) 
From this, we can see that it is not possible to achieve a 0(l/n) rate unless c > l/(2(Ai — A 2 )). 
Therefore, we will assume this when stating our final results, although most of our analysis is in 
terms of general 7 „. An interesting practical question, to which we do not have an answer, is how 
one would empirically set c without prior knowledge of the eigenvalue gap. 

1.4 Nested sample spaces 

For n > Uo, let J4 denote the sigma-field of all outcomes up to and including time n: J4 = 
a{Vn„, Xn^+i,..., X„). We start by showing that 

E[^„|< T-„_i(l - 27 „(Ai - A2)(1 - + 0 ( 7 "). 

Initially is likely to be close to 1. For instance, if the initial Vn„ is picked uniformly at random 
from the surface of the unit sphere in then we’d expect 'I'„^ « 1 — 1/d. This means that the 
initial rate of decrease is very small, because of the (1 — term. 

To deal with this, we divide the analysis into epochs: the first takes from 1 — l/d to 1 — 2 /d, the 
second from 1 — 2/dtol — A/d, and so on until T'n finally drops below 1/2. We use martingale large 
deviation bounds to bound the length of each epoch, and also to argue that does not regress. In 
particular, we establish a sequence of times rij such that (with high probability) 

21 

sup < 1 - -j. (1) 

n>nj ^ 
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The analysis of each epoch uses martingale arguments, but at the same time, assumes that T'n re¬ 
mains bounded above. Combining the two requires a careful specification of the sample space at 
each step. Let denote the sample space of all realizations {vn„, Xn„+i, Xn„+ 2 , ■ ■ ■)’ P the 
probability distribution on these sequences. For any (5 > 0, we define a nested sequence of spaces 
D • • • such that each is J^„_i-measurable, has probability F’(nJi) > 1 — (5, 
and moreover consists exclusively of realizations oj G that satisfy the constraints ([l]i up to and 
including time n — 1. We can then build martingale arguments by restricting attention to when 
computing the conditional expectations of quantities at time n. 


1.5 Main result 


We make the following assumptions; 

(Al) The Xn G are i.i.d. with mean zero and covariance A. 
(A2) There is a constant B such that ||Xn|P < B. 

(A3) The eigenvalues Ai > A 2 > • • • > Ad of A satisfy Ai > A 2 . 
(A4) The step sizes are of the form 7 ^ = c/n. 


Under these conditions, we get the following rate of convergence for the Krasulina update. 

Theorem 1.1. There are absolute constants Ao,Ai > 0 and 1 < a < 4: for which the following 
holds. Pick any 0 < i5 < 1, and any Co > 2. Set the step sizes to 7 „ = c/n, where c = Co/(2(Ai — 
A 2 )), and set the starting time to Uq > {AoB^c^d^/d^) ln(l/(5). Then there is a nested sequence of 
subsets of the sample space D D A ■■ ■ such that for any n > Uq, we have: 




P{^n) > 1 — <) and 
{Vn■v*y^ , , 


> 1 - 


1 


L IIKIP J - V 2(Co-2) J n+I 

where E„ denotes expectation restricted to 17^. 




Uq + I 
n + 1 


Co/2 


Since Co > 2, this bound is of the form E„ [T'ji] = 0{l/n). 

The result above also holds for the Oja update up to absolute constants. 

We also remark that a small modification to the final step in the proof of the above yields a rate of 
E„ for Co < 2, with an identical definition of E„ The details are in the 

proof, in Appendix D.2. 


1.6 Related work 

There is an extensive line of work analyzing PCA from the statistical perspective, in which the con¬ 
vergence of various estimators is characterized under certain conditions, including generative models 
of the data m and various assumptions on the covariance matrix spectrum mil and eigenvalue 
spacing mi. Such works do provide finite-sample guarantees, but they apply only to the batch case 
and/or are computationally intensive, rather than considering an efficient incremental algorithm. 

Among incremental algorithms, the work of Warmuth and Kuzmin ifTSl describes and analyzes 
worst-case online PCA, using an experts-setting algorithm with a super-quadratic per-iteration cost. 
More efficient general-purpose incremental PCA algorithms have lacked finite-sample analyses m. 
There have been recent attempts to remedy this situation by relaxing the nonconvexity inherent in 
the problem El or making generative assumptions 0 . The present paper directly analyzes the oldest 
known incremental PCA algorithms under relatively mild assumptions. 


2 Outline of proof 

We now sketch the proof of Theorem ll.il almost all the details are relegated to the appendix. 

Recall that for n > no, we take to be the sigma-field of all outcomes up to and including time n, 
that is, J~Yi — 7 ^Uo + l ; ■ • ■ ; ■ 
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An additional piece of notation: we will use u to denote u/||u||, the unit vector in the direction of 
u e Thus, for instance, the Rayleigh quotient can be written G{v) = v'^Av. 

2.1 Expected per-step change in potential 

We first bound the expected improvement in in each step of the Krasulina or Oja algorithms. 
Theorem 2.1. For any n > Uo, we can write < 'hn-i + Pn — Zn, where 



and where Zn is a Tn-measurable random variable with the following properties: 



• \Zn\< ^InB. 

The theorem follows from Lemmas IA.4I and IA.5I in the appendix. Its characterization of the two 
estimators is almost identical, and for simplicity we will henceforth deal only with Krasulina’s 
estimator. All the subsequent results hold also for Oja’s method, up to constants. 

2.2 A large deviation bound for T'n 

We know from Theorem 12.II that < T'„_i + /3n — Zn, where /3„ is non-stochastic and Zn is 

a quantity of positive expected value. Thus, in expectation, and modulo a small additive term, 
decreases monotonically. However, the amount of decrease at the nth time step can be arbitrarily 
small when is close to 1. Thus, we need to show that is eventually bounded away from 1, 
i.e. there exists some Co > 0 and some time Uq such that for any n > Uo, we have T'ji < 1 — Cq. 

Recall from the algorithm specification that we advance the clock so as to skip the pre-no phase. 
Given this, what can we expect to to be? If the initial estimate Vn^ is a random unit vector, then 
= 1 — 1/d and, roughly speaking, Pr(T'jj^ > 1 — e/d) = 0{-^). If Uo is sufficiently large, 
then T'n may subsequently increase a little bit, but not by very much. In this section, we establish 
the following bound. 

Theorem 2.2. Suppose the initial estimate is chosen uniformly at random from the surface of 
the unit sphere in Assume also that the step sizes are of the form 7 „ = c/n, for some constant 
c > 0. Then for any 0 < e < 1, if Uo > 2B^c^d^/e^, we have 



Pr sup 


To prove this, we start with a simple recurrence for the moment-generating function of 

Lemma 2.3. Consider a filtration {Tn) and random variables Yn, Zn & J^n such that there are two 
sequences of nonnegative constants, {fin) and {(/n), for which: 


* Yn Yn-l Y fin — Zn- 

• Each Zn takes values in an interval of length C,n- 

Then for any t > 0, we have | < exp(f(y„_i — E[Z„|J^„_i] + fin +fCn/8))- 


This relation shows how to define a supermartingale based on , from which we can derive a 
large deviation bound on Yn- 

Lemma 2.4. Assume the conditions of Lemma \2.3\ and also that E[Z„|J^„_i] > 0. Then, for any 
integer m and any A,t > 0, 
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In order to apply this to the sequence we need to first calculate the moment-generating func¬ 

tion of its starting value 

Lemma 2.5. Suppose a vector V is picked uniformly at random from the surface of the unit sphere 
in where d > 3. Define Y = 1 — (V]^)/||1^|P- Then, for any t > 0, 

- M 2t 


Putting these pieces together yields Theorem l2.2l 


2.3 Intermediate epochs of improvement 

We have seen that, for suitable e and rio, it is likely that < 1 — e/d for all n > Uq- We now 
define a series of epochs in which 1 — successively doubles, until 'I'n finally drops below 1/2. 

To do this, we specify intermediate goals (rio, Co), (ni, ei), (^ 2 , £ 2 )) • ■ •, {nj, ej), where Uo <ni < 
■ ■ ■ < nj and Co < ei < • • • < ej = 1/2, with the intention that: 

For all 0 < j < J, we have sup < 1 — . (2) 

n>nj 

Of course, this can only hold with a certain probability. 

Let O denote the sample space of all realizations a:„„+ 2 ,...), and P the probability 

distribution on these sequences. We will show that, for a certain choice of {(rij, ej)}, all J -f 1 
constraints dD can be met by excluding just a small portion of O. 

We consider a specific realization a; G O to be good if it satisfies (|2|i. Call this set O': 

O' = {w G O : sup < 1 — Cj for all 0 < j < Jj. 

n>nj 


For technical reasons, we also need to look at realizations that are good up to time n—1. Specifically, 
for each n, define 

0)j = {w G O : sup 'kf (w) < 1 — Cj for all 0 < j < J}. 

rij <£<n 

Crucially, this is .F„_i-measurable. Also note that O' = nn>ra 

We can talk about expectations under the distribution P restricted to subsets of O. In particular, let 
Pn be the restriction of P to O^; that is, for any A C O, we have Pn{A) = P{A fl 0)^)/P(O^). As 
for expectations with respect to Pn, for any function / : O —R, we define 

\ [ f{^)P{duj)- 

Jn' 


Here is the main result of this section. 


Theorem 2.6. Assume that 7 „ = c/n, where c = Co/(2(Ai — A 2 )) and Cq > 0. Pick any 0 < i5 < 1 
and select a schedule (rio, eo), ■ ■ ■, (nj, ej) that satisfies the conditions 


forO < j < J, and ej_i < j 
(nj+i + 1) > e^P‘’(nj -I- 1) for 0 < j < J 


(3) 


as well as Uq > (2i)c^/e/f) ln(4/(5). Then Pr(O') > 1 — A 


The first step towards proving this theorem is bounding the moment-generating function of T'n in 
terms of that of 'l'„_i. 

Lemma 2.7. Suppose n > rij. Suppose also that yn = c/n, where c = Co/(2(Ai — A 2 )). Then for 
any t > 0, 


<E„ 


exp 



1 



-4}}^-j- 
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We would like to use this result to bound IE„ in terms of [^'m] for m < n. The shift in 
sample spaces is easily handled using the following observation. 

Lemma 2.8. If g : R —^ R ii nondecreasing, then E„[p(T'„_i)] < E„_i[p(T'„_i)]/or any n > Uo- 
A repeated application of Lemmas I2.7l and l23] vields the following. 


Lemma 2.9. Suppose that conditions @ hold. Then for 0 < j < J and any t > 0, 



Now that we have bounds on the moment-generating functions of intermediate T'„, we can apply 
martingale deviation bounds, as in Lemma 1231 to obtain the following, from which Theorem 12.61 
ensues. 

Lemma 2.10. Assume conditions hold. Pick any Q < i5 < 1, and set no > (20c^i?^/eo) ln(4/(5). 
Then 



2.4 The final epoch 

Recall the definition of the intermediate goals {nj,ej) in (l2]l, Q. The final epoch is the period 
n > nj, at which point < 1/2. The following consequence of Lemmas IA.4I and 12.81 captures 
the rate at which 'k decreases during this phase. 

Lemma 2.11. For all n > nj, 


Enl^'n] < (1 - an)E„_i[T'„_i] +^„, 

where = (Ai - A 2 ) 7 n and /3„ = 


By solving this recurrence relation, and piecing together the various epochs, we get the overall 
convergence result of Theorem ll.ll 

Note that Lemma l2. 11 I closelv resembles the recurrence relation followed by the squared L? distance 
from the optimum of stochastic gradient descent (SGD) on a strongly convex function am . As 
0, the incremental PCA algorithms we study have convergence rates of the same form as 
SGD in this scenario. 

3 Experiments 

When performing PCA in practice with massive d and a large/growing dataset, an incremental 
method like that of Krasulina or Oja remains practically viable, even as quadratic-time and -memory 
algorithms become increasingly impractical. Arora et al. El have a more complete discussion of 
the empirical necessity of incremental PCA algorithms, including a version of Oja’s method which 
is shown to be extremely competitive in practice. 

Since the efficiency benefits of these types of algorithms are well understood, we now instead focus 
on the effect of the learning rate on the performance of Oja’s algorithm (results for Krasulina’s are 
extremely similar). We use the CMU PIE faces flSl . consisting of 11554 images of size 32 x 32, 
as a prototypical example of a dataset with most of its variance captured by a few PCs, as shown in 
Fig. 1. We set no = 0. 

We expect from Theorem 1 1.1 1 and the discussion in the introduction that varying c (the constant in 
the learning rate) will influence the overall rate of convergence. In particular, if c is low, then halving 
it can be expected to halve the exponent of n, and the slope of the log-log convergence graph (ref. 
the remark after Thm. fO . This is exactly what occurs in practice, as illustrated in Fig. 2. The 
dotted line in that figure is a convergence rate of 1 /n, drawn as a guide. 
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Figures 1 and 2. 


4 Open problems 

Several fundamental questions remain unanswered. First, the convergence rates of the two incre¬ 
mental schemes depend on the multiplier c in the learning rate 7 „. If it is too low, convergence will 
be slower than 0{l/n). If it is too high, the constant in the rate of convergence will be large. Is 
there a simple and practical scheme for setting cl 

Second, what can be said about incrementally estimating the top p eigenvectors, for p > 1? Both 
methods we consider extend easily to this case cni; the estimate at time n is a d x p matrix Vn 
whose columns correspond to the eigenvectors, with the invariant V^Vn — Ip always maintained. 
In Oja’s algorithm, for instance, when a new data point G arrives, the following update is 
performed: 

Wn = Vn-l+lnXr,X^Vn-l 
Vn = orth(fF„) 

where the second step orthonormalizes the columns, for instance by Gram-Schmidt. It would be 
interesting to characterize the rate of convergence of this scheme. 

Finally, our analysis applies to a modified procedure in which the starting time rio is artificially set 
to a large constant. This seems unnecessary in practice, and it would be useful to extend the analysis 
to the case where Uq = 0. 
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A Expected per-step change in potential 

A.1 The change in potential of Krasulina’s update 

Write Krasulina’s update equation as 

V^n — ^n—1 '^n^n 

in = (XnXl - V^_,XnX^Vn-lId) K-l 

We start with some basic observations. 

Lemma A.l. For all n > Uo, 

(a) in is orthogonal to Vn-i- 

(b) \\in\?<B^Vn-rriA. 

(c) = AVn-l - G(K-i)K-1- 

(d) ||y„|| > ||F„-i||. 

Proof. For (a), let denote the component of A„ orthogonal to 14,-1- Then 

= {Vn-l-Xn)Xn-{Vn-l-XnfVn-l = (4„_1 • A„) (A„ - (14_i • A„) ) = (14_i • A„) . 

For (b), note from the previous formulation that II IP = (t4-i-A„)^||Ap|p < ||t4-i|P||A„|p/4. 
Part (c) follows directly from E[A„Aj| J4_i] = A. 

For (d), we use ||4„|p = ||14-i + luinW^ = IIK-if + llWinf > IlK-ilP- □ 

We now check that (14 • grows in expectation with each iteration. 

Lemma A.2. For any n > rio, we have 

(a) (14 • > (14-1 • + 27„(14_i • v*){in ■ v*). 

(b) E[^„ • V*\Xn-l] = {Vn-l ■ v*){Xi - G(K-i)). 


Proof. Part (a) follows directly from the update rule: 

{Vn ■ V*f = {{Vn-l ■ V*) + yn{in ' V*)f > {Vn-l ' V*f + 27„(14-1 ' V^i^n ' V*). 

Part (b) follows by substituting the expression for E[^„ | A„_i] from Lemma I aTTI c): 

ElCn • V*\Fn-l] = {Vj_,Av*) - G{Vn-l){Vn-l ■ V*) = Ai(14_i • V*) - G{Vn-l){Vn-l ■ V*). 

□ 


In order to use Lemma lAi2l to bound the change in potential '^n, we need to relate to the quantity 
Ai - G{Vn). 

Lemma A.3. For any n > Uo, we have Ai — G(14) > (Ai — A 2 )'I'„. 


Proof. It is easiest to think of 14 in the eigenbasis of A\ the component of 14 in direction v* is 
14 • u*, and the orthogonal component is Vn = 14 — (14 • v*)v* . Then 

1 _ VnAVn {Vn ■ V*f , , {V^^AV^ ^ Ai(K • + A 2 IIKPIP 

^[Vn) — IIT^ 119 — — 11-^^ 119 —M H- -- - ^ 


IIK^IP 


IIKIP 


IIKII 


2 — 


IIK. 


Therefore, 

Ai—G(14) > Ai — 


Ai(14-i;*)2 + A2(||K|P-(14.u*)2) 

IIV4IP 


— (Ai—A 2 ) ( 1 — 


{Vn ■ Vy 

IIV4IP 


— (Ai- 

□ 


A2)«'n 
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We can now explicitly bound the expected change in in each iteration. 

Lemma A.4. For any n > Uq, we can write < + /3„ - Zn, where /3„ = 7^5^74 and 

where 

Zn = 27n(K-l • V*){£,n ' V*)/\\Vn-lf 
is a Tn-measurable random variable with the following properties: 

• E[Z„|J-„_i] = 27 „(F„_i • v*)\Xi - G{Vn-i)) > 27„(Ai - A2)^„-i(l - ^„-i) > 0. 

• \Zn\< ^InB. 


Proof. Using Lemmas lATI and lA.il' al. 

^ \\Vnr-{Vn-V*f ^ \\Vn-ir+llUnr-{Vn-V*f 
IIKP - IIK-lP 

< 1 + i^2^2 _ jVn • V*? 

/ , 1 2d2 iVn-l-V*)"^ +2y„{Vn-l-V*){^n-V*) 

+ -P47P- 

= U/ _|_l-^2n2_« (Un-1 ■ V*){^n ' V*) 

n-l+47n In 

whichis ^!n-i+Pn—Zn- The Conditional expectation of Z^ can be determined from Lemma lA.2r b): 

E[Z„|J-„_i] = • v*\Tn-i] = 27n(i4-i • - G(14_i)) 

II Vn-l\\ 

and this can be lower-bounded using Lemma lA3] 

Finally, we need to determine the range of possible values of By expanding we get 

= 27„(U„_i • V*) ((A„ • u*)(A„ • U„_i) - (i4_i • u*)(A„ • . 

Since ||A„p < B, we see that must lie in the range ±47„i3. □ 


A.2 The change in potential of the Oja update 

Recall the Oja update: 


^ Vn-l + 7nA„AjU„_i 

” ||14_i+7„A„AjU„_i|r 

Since our bounds are on the potential function T'„, which is insensitive to the length of U„, we can 
skip the normalization, and instead just consider the update rule 

Kt = Kj_1 -f 7nA„XjU„_i. 


The final bounds, as well as many of the intermediate results, are almost exactly the same as for 
Krasulina’s estimator. Here is the analogue of Lemma lA.4l 

Lemma A.5. For any n > Uo, we can write < 'Fn-i — Zn + fin, where Zn is the same as in 
Lemma WM and /3„ = + 2j!fB^. 


Proof This is a series of calculations. First, 

(K • V*f = {{Vn-l • V*)+yn{V^-^XnXlv*)f 

> {Vn-l ■ V*f + 27„(U„_i • v*){V^_^XnXlv*). 

Similarly, 

||14f = \\Vn-l+lnXnXlVn-lf 

= \\Vn-lf+ll\\XnXlVn-lf+2^n{Vn-l-Xnf 

< ||K_if (1 + 7^52 + 27„(U„_i • Xnf) 
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where we have used ||^n|P < B. Combining these, 

{Vn • V*? > {Vn -1 - V*)^ + 274K-1 ■ V*){Vj_,XnX^V*) 
l|K^P - ||K_i||2(1 + 7252 + 27„(K_i.X„)2) 

^ {Vn -1 • V*f + 27„(y„-i ■ v*){Vj_,X^X^v*) 

1 + 72 S 2 + 27 „(C„_i.X „)2 

> ((K_1 • v*)^ + 27„(i4_i ■ v*)iVj_,X^X^v*)) (l - InB^ - 27n(K_i • X^f^ 

> (K_1 • v*)^ + 27„(F„_i • u*) (v^_^X„X^v* - (y„_i • U*)(F„_1 • X„)2) - 57^52 _ 

where the final step involves some extra algebra that we have omitted. The lemma now follows by 
invoking = 1 — (V„ ■ v*)"^. □ 


B A large deviation bound for 

B.l Proof of Lemma 1231 


For any t > 0, 


E <E 


= e 


7(y„_i+/3„-z„)| 




= E 


^ — tE[Zn I ^ —E[Zn|^n-l]) 


n—1 




^-t{Zr^-nZr^\Tr^-l\) 


n—1 


We bound the last expected value using Hoeffding’s lemma: E[e*'^] < “)^/8 for any random 

variable W of mean zero and range [a, 6]. 


B.2 Proof of Lemma 12,41 

By Lemma 123] 

E J'n-l] < exp ^tYn-l + t/3n + 

Now let’s define an appropriate martingale. Let + and let = exp(f(y„ + 

Tji)). Thus Mn € Tn, and 


E[M„|J3_i] = E[e*'"" \Xn-i] exp(tr„) < exp ( tYn-i + tPn + + fT„ ) = M„_i. 


Thus {Mn) is a positive-valued supermartingale adapted to (J3)- A version of Doob’s martingale 
inequality—see, for instance, page 274 of m — then says that for any m, we have Pr(sup„>^ M„ > 
(5) < {EMm)/S. Using this, we see that for any A > 0, 


Pr ( sup 17i > A ) < Pr ( sup + t„ > A I = Pr ( sup M„ > e 


„tA 


< 


EM„ 


= exp(-f(A - Tm))Ee 


tYrr, 


B.3 Proof of Lemma 12,51 

It is well known that V can be chosen by picking d values Z = {Zi,..., Zd) independently from 
the standard normal distribution and then setting V = Z/||Z||. Therefore, 

y Z^ + --- + Z^ lUl 

Zf + (Z| + • • • + ZJ) lUi + fU2 ’ 
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where Wi is drawn from a chi-squared distribution with d — 1 degrees of freedom and W 2 is drawn 
independently from a chi-squared distribution with one degree of freedom. This characterization 
implies that Y follows the Beta((ci — 1)/2,1 /2) distribution: specifically, for any 0 < y < 1, 

The moment-generating function of this distribution is 

r(^)r(i) io y y) y 

There isn’t a closed form for this, but an upper bound on the integral can be obtained. Assuming 
d > 3, 


f _ y) < [ e‘^(l - y) ^/"^dy 

Jo Jo 

Y^z-^/^dz 


e 

71 


-Vi Jo " 


^z-^/^dz = ^r(i/ 2 ), 


where the second step uses a change of variable z = f(l — y), and the fourth uses the definition of 
the gamma function. To finish up, we use the inequality r(z -f 1/2) < ^/zT(z) (Lemma fB.ll l to get 


Ee‘^ < 


r(i) e* 

r(^^) Jf 


< e' 


d- 1 


2 t 


The following inequality is doubtless standard; we give a short proof here because we are unable to 
find a reference. 


Lemma B.l. For any z > 0, 

r(^z+7^ < VzT{z). 


Proof. Suppose a random variable T > 0 is drawn according to the density Pr(T = t) (x F *. 
Let’s compute ET and KVT: 


ET = 


eVt = 


/g“ Fe *dt r(z + 1) _ _ 
F-^e-Ht ~ r(z) 

/q°° F-^/^e-*dt _ T{z +1/2) 
F~^e~*^dt r( 2 ;) 


where we have used the standard fact r( 2 : -f 1) = zr{z). By concavity of the square root function, 
we know that Ev^ < VET. This yields the lemma. □ 


B.4 Proof of Theorem I2.2I 


From Lemma lA~^ a). we have < ^n-i+Pn - Zn, Where and E[Z^\T^_,] > 0, 

and Zn lies in an interval of length = 8jnB. We can thus directly apply the first deviation bound 
of Lemma lL4l 


Since 


£>n £>n 



C 


2 


7 

n 
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we see that for any t > 0, 


E (ft+f) = E s 

i>no ^ ' l>no ^ / o 

To make this < e/d, it suffices to take Uo > B^c^d(l + 32f)/(4e), whereupon Lemma l24l vields 

Pr f sup > 1 - 3 ) < 

\n>no dj 

< e*-J = (,2et/d ^ 

~ \ 2t V 2f' 

where the last step uses Lemma 1231 The result follows by taking t = d/(Ae). 


C Intermediate epochs of improvement 
C.l Proof of Lemma |2J1 

Lemma rA.4l establishes an inequality T'n < T'n_i — + (3n as well as a lower bound on 

where Z„ is a random variable that lies in an interval of length (/„ = S'jnB. From 
Lemma 1231 we then have 

< exp(f(T'„_i -E[Z„|J-„_i] +/3„ + fC"/8)) 

< exp - 27 „(Ai - A2)4'„-i(l - 4'„-i) + + 32f)/4)) 

= exp (<(4'„_i - Co4'„-i(l - T'„_i)/n + c^B^(l + 32f)/4n^)) 

For any w G we have 4'„_i(a;) < 1 — e^. Taking expectations over we get the lemma. 


C.l Proof of Lemma I2l8l 

Let j be the largest index such that nj < n. Then 


T'„_i(u;) has value 


< 1 — Ej for uj G fl'„ 
>l-ej foruj Gn'„_^\n'^ 


Thus the expected value of over is at most the expected value over 


C.3 Proof of Lemma I2l9l 

We begin with the following Lemma. 
Lemma C.l. For any n > rij and any t > 0, 


E„[e*'^"'] < exp f(l - ej) 


n, + 1 

n + 1 


+ 


tc^B'^il + 32f) 



Proof. Define = 1 — (coej/n) and = c^B^tfl + 32t)/Anf. By Lemmas l2.7l and 12781 for 
n > rij, 

E„[e*'*'"] < E„[e‘“"'^"-i]exp(^„(<)) < E„_iexp(5„(f)). 
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By applying these inequalities repeatedly, for n shrinking to nj + 1 (and t shrinking as well), we get 
]En[e**”] < E„^+i [exp • •-an^.+i)] exp(^„(f)) exp(^„_i(fa„)) • • •exp(^„^+i(to^ 

< En.+i [exp • • • a„^+i)] exp(^„(f)) exp(^„_i(f)) • • • exp(^„^.+i(f)) 


= E. 


■rij + l 


exp (i»„, (l - 


1 - 


n — 1 


1 _ 


n-j + 1 


exp 


^B‘^t{l + 32t) f 1 1 


< exp I f(l — e^-) exp —CqE 


(n — 1)^ 

1 


rij + 1 


+ ••• + 

1 

-h - 

n 


c^BH(l + i2t) { I 1 

exp I -^^ T- tttt + • • • + 


(rij + If 


1 


4 \n? {n — If {rij + lf 

since (w) < 1 — for all w G We then use the summations 


1 


rij + 1 


1 

- > 


nn-\-l 


dx , 
— = In 


n + 1 


(rij + 1)2 


H-H-^ < 

rt ^ 


'rij+l ■ 

f" dx 


rij + 1 


1 1 


n, n 


to get the lemma. 


□ 


To prove Lemma |Z91 we note that under conditions @, 


(1 - ef) 


rij + 1 


< < l-3ej < 1 - 


ef+i “ ^3 


rij+i + 1^ 

We have used the fact that < 1 — x for 0 < x < 3/4. The rest follows by applying Lemma ICTI 
with n = rij+i. 

C.4 Proof of Lemma l2.10l 

Pick any 0 < j < J. We will mimic the reasoning of Theorem 12.21 being careful to define martin¬ 
gales only on the restricted space and with starting time rij. Then 


Pn^ [ sup T'n > 1 - Cj ) < E„^. [e*'^"i ] exp ( -f(l ^ 

. n>n. 


4n,- 


< exp ( —tcj-i 


^rij-i 

where the second step invokes Lemma lZ9l 

To finish, we pick t = (2/eo) ln(4/i5). The lower bound on rio is also a lower bound on nj_i, and 
implies that tfB^{l + 32f)/4nj_i < feo/ 2 , whereupon 


Pn^ sup T*™ > 1 - ej < exp - 


T>nn- 


tej-1 


gYj-i/<^o ^ 

< 


2 f+i' 


Summing over j then yields the lemma. 


D The final epoch 

D.l Proof of Lemma 12,111 

By Lemma|Aj4l 


E['I'„|J^ n-i] < — 27„(1 — T'„_i)(Ai — A 2 )) + fn- 


Oinj+2j) 
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For realizations w G we have < 1/2 and thus the right-hand side of the above 

expression is at most (1 — an)'ifn-i + /?»■ Using the fact that U/ is J^„_i-measurable, and taking 
expectations over 

< (1 - a„)E„[^'„_i] -f /3„ 

< (1 — + I3n, 

as claimed. The last step uses Lemma 127^ 


D.2 Proof of Theorem ll.il 

Define epochs {rij, ej) that satisfy the conditions of Theorem 12,61 with ej = 1/2, and with ej+i = 
2ej whenever possible. Then J = log 2 l/(2eo) and 


nj + 1 = (no -f 1 ) exp ( ^ j = {uo + 1) 


5/(co In 2) 


= K + i)|f^) 


5/(co In 2) 


By Theorem 12.61 with probability > 1 — i5, we have < 1/2 for all n > nj. More precisely, 
P{^n) > 1 — <5 for all n > Uq. 

By Lemma l2.11l for n > nj, 

Eni^-n] < (l - E„_i[T'„_i] + 

\ nJ 

for a = Co/2 and b = c^B^/4. By the a > 1 case of Lemma lD.il 


E„[vk„] < ( E„,[vl/„; - — 


< 


n + 1 / 
1 / Ho + I 


a — 1 


1 + 


1 \ 1 




nj + 1/ n + 1 
a + 1 ^ 1 

■ exp ■ 


2 \ n + 1 J \ J a — 1 ^ \nj + 1J n + 1 

which upon further simplification yields the bound of Theorem ll.ll for a > 1. 

(Note that the a < 1 case of Lemma ID. 1 1 yields a rate of E„ ['k„] = 0(n““).) 

Lemma D.l. Consider a nonnegative sequence {ut : t > to), such that for some constants a,b > 0 
and for all t > to > 0, 

/as b 

«.<(!- -j 


Then, writing the zeta function (^{s) = 

(to+lYuf + 

t-ri J “‘o ^ a-l ^ to + 1 J t +1 


Ut < 


(^) Ut,+AbC{a-2)jj^ 


a > 1 

a < 1 


Proof Recursively applying the given recurrence for ut yields 


Ut < 


\i=to + l 

To bound the product term, we use 


n (>-?) ••.+ E Mn U-) 


i=to + l \l=*-l-l 


ri ^ expl'-a 




i—to 


pi+l 


'to + 1 


dx 


fo + 1 
f + 1 
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Therefore, 


Ut < 



< 



b 

{t + 1 )“ 



We finish by bounding the summation of (i + 1)“ ^ by a definite integral, to get: 


t 

^ (* + l )“-2 < 
i—to + 1 


C(a - 2) 


a > 1 
a < 1 


□ 
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